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Abstract 

This paper applies conformal prediction techniques to compute simultane- 
ous prediction bands and clustering trees for functional data. These tools 
can be used to detect outliers and clusters. Both our prediction bands 
and clustering trees provide prediction sets for the underlying stochastic 
process with a guaranteed finite sample behavior, under no distributional 
assumptions. The prediction sets are also informative in that they cor- 
respond to the high density region of the underlying process. While or- 
dinary conformal prediction has high computational cost for functional 
data, we use the inductive conformal predictor, together with several 
novel choices of conformity scores, to simplify the computation. Our 
methods are illustrated on some real data examples. 



1 Introduction 



Functional data analysis has been the focus of much research efforts in the statistics 
and machine learning community in the last decade. In functional data analysis, the 
data points are functions rather than scalars or vectors. The functional perspective 
provides a powerful modeling tool for many natural processes with certain smoothness 
structures. Typical examples arise in temporal-spatial statistics, longitudinal data, 
genetics, and engineering. The literature on functional data analysis is growing very 
quickly and familiar techniques like regression, classification, principal component 
analysis and clustering have all been extended to functional data. The books by 



Ramsay & Silverman (1997) and Ferraty & Vieu (2006) have established the state of 



the art. 

The focus of this paper is exploratory analysis and visualization for functional 
data, including detection of outliers, median sets, and high density sets. Due to 
its infinite dimensional nature, visualization is a challenging problem for functional 
data since it is hard to see directly which curves are typical and which are abnor- 
mal. In the literature, many classical notions and tools for ordinary vector data have 
been extended to functional data, using either finite dimensional projection methods 
such as functional principal components (FPCA) and spline basis, or componentwise 
methods. For example, Hyndman & Shang (2010) developed functional bagplots 
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and boxplots, where a band in functional space is obtained by first applying bivariate 



bagplots Rousseeuw et al. ( 1999 ) on the first two principal component scores, and pro- 



jecting the bivariate sets back onto the functional space. In Sun & Genton (2011 ), the 



sample functions are ordered using a notion of functional data depth Lopez-Pintado 



& Romo (2009) and a band is constructed by taking componentwise maximum and 
minimum of sample quantile functions. Other related work in outlier detection mostly 
use functional data depth to order the curves, and then apply robust estimators to 
detect outliers, 



see 



Cuevas et al. (2007), Febrero et al. (2008), Gervini (2009). 



In this paper, we visualize functional quantiles in the form of simultaneous pre- 
diction bands. That is, for a given level of coverage, we construct a band that covers 
a random curve drawn from the underlying process. Our prediction bands are con- 
structed by combining the finite dimensional projection approach and the idea of 
conformal prediction, a general approach to construct distribution free finite sample 



vahd prediction sets Vovk et al. (2005); Shafer & Vovk (2008). Although originally 



developed as a tool for online prediction, conformal prediction methods have been 
proven to be useful for general nonparametric inferences, yielding robust and efficient 



prediction sets Lei et al. (2013); Lei & Wasserman (2013). To apply conformal meth- 



ods to functional data, a major challenge is computation efficiency. Computing the 
bands with finite dimensional projection is infeasible using ordinary conformal predic- 
tion methods because the conformal prediction sets are usually hard to characterize. 
Here we use the inductive conformal method which allows efficient implementation 
for functional data when combined with some carefully chosen conformity scores. 
The resulting prediction bands always give correct finite sample coverage, without 
any regularity conditions on the underlying process. Unlike many existing functional 
boxplots, our bands tend to reflect the "high density" regions in the functional spaces 
and may have disconnected slices. In some cases, such a property can also reveal 
other salient structures in the data such as clusters. 

Another contribution of this paper is the construction of clustering trees with fl- 
nite sample interpretation for functional data. In classical vector cases, clustering, 
together with the aforementioned prediction sets, is closely related to density level 
sets iHartiganl (|1975l) ; iRinaldo & Wassermanl (|2010l) ; iRinaldo et al.l (|2012l) . However, 



in functional spaces, the density is not well deflned because there is no cr-flnite dom- 
inating measure. In Ferraty & Vieu (2006), a notion of "pseudo density" is used 



instead of density. In context of functional data, most methods use flnite dimensional 
projections, combined with classical clustering methods such as K-means [Tarpey ^ 



Kinateder (2003); Antoniadis et al. (2010) and Gaussian mixtures James & Sugar 



(2003); Shi & Wang (2008). A componentwise approach to functional data clustering 



is studied by Delaigle et al. (2012). In this paper we construct clustering trees for 



functional data by combining the pseudo density idea Ferraty & Vieu (2006) and 



conformal prediction. Each level of our cluster tree corresponds to an estimated pre- 
diction set, with distribution free, flnite sample coverage. The clustering trees are 
therefore naturally indexed by the level of coverage. Moreover, the prediction sets at 
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a higher level on the tree correspond to regions with higher pseudo density. 

To our knowledge, this paper is the first to do prediction and visualization of 
functional data with finite sample guarantees. In Section |2] we introduce the problem 
of prediction bands for functional data and the general idea of conformal prediction 
method under this context. In Section |3] we develop the extended conformal prediction 
and apply it to efficiently construct simultaneous prediction bands for function data 
with good finite sample property. In Section |4] we develop our method of conformal 
cluster tree. Both methods are illustrated through real data examples. 



2 Notation and Background 

2.1 Prediction Sets for Functional Data 

Let Xi(-), ■ ■ ■ , Xn{-) denote n random functions drawn from an unknown distribution 
P over the set of functions on [0, 1] with finite energy: Q = L2[0, 1]. The distribution P 
is defined on an appropriate a-field J-" on Q. Given the data and a number < a < 1, 
we will construct a set of functions C„ C ^2 such that, for all P and all n, 

P(X„+i G C„) > 1 - a (1) 

where Xn+i denotes a future function drawn from P and P denotes the probability 
corresponding to P"+i, the product measure induced by P. 

Requiring ([T]) to hold is very ambitious and may be more than needed. If we are 
only interested in the main structural features of the curve, then we instead aim for 
the more modest goal that, for all P and all n, 

P(n(x„+i) e c„) > 1 - « (2) 

where 11 is a mapping into a finite dimensional function space Qp C Q. The projection 
n may correspond to the subspace spanned by the first few functions in Fourier basis, 
wavelet basis, or any other orthonormal basis of L2[0, 1]. 

In practice it is often of interest (for example, for visualization purposes) to have 
prediction bands. A prediction band Bn is a prediction set of the form Bn = G 
L2[0, 1] : X{t) G Bn(t), V t G [0, 1]}, where for each t, Bn{t) C M} can be expressed as 
the union of finitely many intervals. Most existing prediction bands for functional data 



with provable coverage relies on the assumption of Gaussianity (see, for example, Yao 



et al. (2005)). It is desirable to construct distribution free, finite sample prediction 



bands for functional data under general distributions. 



2.2 Conformal Prediction Methods and Variants 



Conformal inference Vovk et al. (2005); Shafer & Vovk (2008) is a very general the- 
ory of prediction, focused on sequential prediction. For our purposes, we only need 
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the following batch version. Given the observed objects (random variables, random 
vectors, random functions, etc.) Xi, . . . , X„ we want a prediction set for a new object 
Xn+i- Assume that Xi E Q and let x E Q he a fixed, arbitrary object. We will test 
the null hypothesis that Xn+i = x and then take Cn to be the set of x's that are not 
rejected by the test. Here are the details. 

Define the augmented data aug(x) = {Xi, . . . ,X„,x}. Define conformity scores 



CTi, . . . , (Tn+i where ai = g{Xi, aug(x)) for some function g. (Actually, in Vovk et al. 



(2005) they omit Xj from g when defining a,. We discuss this point at the end of 
this section.) For i = n + 1, the score is defined to be cr^+i = g{x,aug{x)). The 
conformity score measures how similar Xj is to the rest of the data. An example is 
ai = - J{Xi{t) - X\t)fdt where 

X\t) = {n + l)-^[x{t) + Y.t,Ut)]. 

In fact, we will use some other conformity scores that are better adapted to the data 
distribution. 

Consider testing the null hypothesis Hq : X„+i = x. When Hq is true, the 
objects in the set aug(x) = {Xi, . . . , X„, x} are exchangeable and so the ranks of the 
conformity scores are uniformly distributed. Thus, the p-value 

vr(x) = ^"=i' ^(^- ^ ^"+^) (3) 

is uniformly distributed over {l/(n + l), 2/ (n + !),...,!} and is a valid p-value for the 
test in the sense that P[7r„(X„4.i) > a] > 1 — a. We now invert the test, that is, we 
collect all the non-rejected null hypotheses: 

Cn = {x : 7r(x) > a}. (4) 

It follows from the above argument that 

P(X„+i e C„) > 1 - a (5) 

for any P and n. We refer to the above construction as the standard conformal 
method. Next we discuss the use of inductive conformal method which, as we shall 
see, has some computational advantages. The general idea is first given in Section 



4.1 of Vovk et al (2005). 
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Algorithm 1: Inductive Conformal Predictor 



Input: Data Xi, ...,X„, confidence level 1 — a, rii < n. 
Output: Cn- 

1. Split data randomly into two parts Xi = {Xi, ...X^}, and 
X2 = {Xni+i, ■■■,Xn}. Let ^2 = n - Ui. 

2. Let : Q i-> R be a function constructed from Xi, . . . , X^ ■ 

3. Define ai = g{Xn-^^i) for i = 1, . . . , n2- Let (T(i) < • • • < 
(J(n2) denote the ranked values. 

4. Let Cn = {x : g{x) > A} with A = a(i(n2+i)a]-i)- 



Lemma 2.1. Let Cn be the output of Algorithm 1. For all n and P, P(X„+i e Cn) > 
1 — a. 

Proof. Note that the random function g{-) is independent of X2. Assume Xn+i is 
another random sample. By exchangeability, the rank of o"„2+i •= 9{^n+i) among 
{(Ti, cr„2+i} is uniform among {1, 2, 712 + 1}. Therefore with probability at least 
1 — a, Xn+i falls in Cn- □ 

In the rest of this paper, unless otherwise noted, we use rii = [n/2\. The data 
splitting step might seem inefficient due to the sample splitting. However, it greatly 
reduces the computational burden of the standard conformal method by avoiding re- 
fitting the function g with every augmented data aug(a;) for all x E Q. Moreover, such 
a reduction can also substantially improve the robustness of the resulting prediction 
sets. For example, consider /c-means clustering in d-dimensional Euclidean space. Let 
Z = (Zi, Zn) C M*^ be a data set. Given k centers vi,. . . ,Vk C M*^, let 



The fc-means prototypes are the functions Vi. . . . . Vk that minimize R. As usual, 
we can partition the data into k groups Gi, . . . ,Gk where Gj = {Z^ : — Vj\\ < 
\\Zi-Vr\\,r ^ j}. 

For the standard conformal method, let Vi{z), . . . ,Vk{z) denote the prototypes 
based on the augmented data Zi, . . . , Zn, z. Define the conformity score ai — — min^ \\Zi— 
Vj{z)\\. Let Cn denote the resulting conformal set. With obvious modification, 
let C'j denote the conformal set based on the modified method. Define diam(C) = 



s^Px,yec\\^ -y\\- 

Proposition 2.1. For any Z, diam(C„) = 00 but diam(C^) < 00 if n > 2/a. 
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Proof. For the first statement, consider the augmented data Zi, . . . , Z„, 2;, where 
II -2 II > C. For C sufficiently large (may depend on Z), there exists a group Gj 
such that Gj = {z}. In this case, an+i = and hence Cn ^ {z : \\z\\ > C} for all a. It 
follows that diam(C„) = 00. The second statement follows since all the prototypes are 
in convhull(Zi) C convhull(Z). For all a, there exists a C > large enough such that 
g{z) < (Ti for all ||z|| > G. The key is that in the modified method, the prototypes 
and the majority of o"j's are not affected by absurd values of z. □ 

Remark 2.1. Bounded prediction sets could also be obtained by omitting Yi from g 



when defining ai. This is, in fact, the method used in Vovk et al. (2005). But this 
"omit one" approach is much more computationally expensive than our data-splitting 
method. 



2.3 Conformal Prediction Bands in the Functional Case 



We conclude this section with a brief discussion on conformal prediction bands for 
functional data. When the Xj's are functions, we can construct prediction bands as 
follows. Given C„, a conformal prediction set, we can define upper and lower bounds 
for all t G [0, 1]: i{t) = infxeCn x(t) and u{t) = sup^g^^ x(t). It follows that 

F[i{t) < Xn+i{t) < u{t), V t] > 1 - a. 

However, these bounds may be hard to compute, depending on the choice of confor- 
mity score. 

The choice of conformity score also affects statistical efficiency. To see this, con- 
sider the following example. Under mild conditions on the process, the random vari- 
able Y = supj |X(t)| has finite quantiles. Then we can construct conformal prediction 
set Cn using the standard approach with conformity score g{X) = — sup^ |X(t)|. Thus 
Cn is a valid prediction set and naturally leads to a band. However, such a band is 
usually too wide and hence of limited use. 

We therefore face two challenges: choosing a good conformity score and extracting 



useful information from C„. In the vector setting. Lei et al. (2013) showed that using 



a density estimator to define a conformity score leads to conformal set C„ with certain 
minimax optimality properties. However, in the present setting, densities do not even 
exist. Instead, we consider two approaches: the first uses density of coefficients of 
projections to construct prediction bands and the second uses pseudo-densities to 
build clustering trees. 

There is also a third challenge: identifying an optimal conformity score. This 



was done using minimax theory in Lei et al. (2013); Lei & Wasserman (2013) in the 
case where the data are vectors. To our knowledge, choosing an optimal conformity 
score for the functional case is still an open question. Rather, our current focus is on 
computation and visualization. 
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3 Prediction Bands Based on Projections 



A common approach to functional data analysis is to project the curves onto a finite 
dimensional space that captures the main features of the curves. In this section, we 
consider the projection approach because it enables us to characterize the prediction 
sets and to construct simultaneous prediction bands with finite sample guarantee. To 
be concrete, we require the prediction band C„ to satisfy P(n(X„+i) G C„) > 1 — a 
where 11 is a mapping into a p-dimensional function space flp C fl. There are two 
types of projections: projections on a fixed basis (for example, Fourier basis, wavelet 
basis, and spline basis) and projections on a data-driven basis such as functional 
principal components (FPC). Our method, summarized in Algorithm 2, is general 
enough so that it can be used for any basis. It is a specific implementation of the 
general method given in Algorithm 1. 



Algorithm 2: Functional Conformal Prediction Bands 

Input: Data basis functions (0i,...0p), conformity 

score /, level a, 1 < ni < n. 
Output: Bnit) C for all t E [0, 1]. 

1. Split data randomly into two parts Xi = {Xi, ...X^}, and 
X2 = {Xn-^+i, ...,X„}. Let n2 = n- rii. 

2. Compute basis projection coefficients = {Xi,(f)j), for 
i = 1, ...,n, j = 1, Denote = (^a, ...,^ip). 

3. For i = m + l, evaluate f{^i) = fi^u^i, ...,^„i) and 
rank these numbers by /i < /2 < ... < /„2. 

4. Define Tn = {^eW: /(O > A = 

5. = {EkO0,W : (Ci,...,Cp) e T4. 



Proposition 3.1. Denote Up the projection operator induced by eig en- functions {0j : 
1 < J < p}- Then the prediction hand Bn given by Algorithm 2 satisfies 



P{ (UpXn+i) (t) G Bn{t), V t G [0, 1]} > 1 



a. 



The proof follows from that of Lemma 2.1 



Remark 3.1. The above finite sample coverage guarantee remains valid if the ba- 
sis functions are estimated from Xi, which is the case in most applications where 
functional principal components are used. 
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3.1 Conformal Prediction Bands using Gaussian Mixture Ap- 
proximation 

In general one can apply Algorithm 2 with any basis (possibly a data-driven one 
obtained from Xi). However, in its abstract form, Algorithm 2 leaves some imple- 
mentation issues unsolved. The most challenging one is the characterization of T„ and 
Bn{t) which depend on the choice of conformity score. Usually / is a density estima- 
tor of the projection coefficient vector ^ G since it makes sense to use high density 



region as a prediction set for future observations. For example. Lei et al. (2013) use 
kernel density estimator to construct /. Although kernel density estimators can ap- 
proximate any smooth densities, it is unclear how to keep track of {^^(t>{t) '■ C, G T„}, 
the linear projection of the resulting prediction sets, where 0(t) = (0i(t), (ppit))'^ . 
However, the inductive conformal method allows us to use any other conformity score 
to simplify the computation. Next we show how to use Gaussian mixture density 
estimator to construct such a band. 

To motivate the method, consider the mixture model 

X ~ P = TTiPi + 7I2P2 + ■■■+ T^kPr, 

where each Pj. is the distribution of a Gaussian process on [0, 1] and tt^'s are mixing 
probabilities satisfying ^^vrfc = 1 and vTfc > 0. Let {(pj : j > 1} be an orthonormal 
basis of L2[0,l], and = {X,(f)j) be the scores of X, where {f,g) = J fg. Then 
(^1, ...,^p) is distributed as a p-dimensional Gaussian mixture. For smooth processes, 
the variance of the projection score on dimension j in each component decays quickly 
as j grows. Therefore, it is common in functional data analysis the sequence of 
scores is truncated for some small p. The basis that gives fastest decay corresponds 
to the functional principal components, where (p/s are the eigen-functions of F, the 
covariance function of X: 

F = E[X ® X] - EX ® EX = ^ Xj [(f)j ® (f)j], 

where for functions f,g & L2[0, 1], we define f ® g : [0,1]^ M"*^: [/ ® g\{s,t) = 
f{s)g(t). We emphasize that our band has correct coverage: (1) without assuming 
that the mixture model is correct, and (2) for all choices of projection dimension and 
numbers of mixture components. 

Going back to Algorithm 2, we can choose / to be a Gaussian mixture density 
estimator with K components and the first p eigen-functions of empirical 

covariance function obtained from Xi. Let 7Tk, Jik, be the estimated mixture 
proportion, mean, and covariance of the kth component. Denote ip{-; /i, S) the density 
function of Norm(/i, S). A rough outer bound of T„, the level set of / at A, can be 
obtained by 

K K 

Tn = {^: m > A} C |J{e : y^{^;i2k,^k) > \/{K7Tk)} := \jT^,k. (6) 

k=l k=l 
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Note that each T„ ^ is an elhpsoid in MP whose projection on can be computed 
easily. Let Ukit) = sup^^zT^ ^^^(j){t) and ik(t) = mf^^T„kC,'^4>(t)- Both Uk and ik are 
available in close form. Then we can output Bn{t) = Uk[ik{t),Uk{t)]. 

3.2 Two refinements 

The approximation in ^ is usually conservative and improvements are usually pos- 
sible. We elaborate this idea in two cases. 

A better approximation of the density level set. First, when the components 
in the Gaussian mixture are well separated (/i^'s are far from each other), then in- 
tuitively the level set of / at A shall be approximately the union of level sets of 
ip{-] lik, Sfc) at X/iTk. Now we make this idea precise and give an more refined approx- 
imation of T„ that also retains finite sample coverage guarantee. 

For all 1 < k,s < K, define 

6ks = sup (nk(p{^] ^k, Sfc) A ns(p{^] /i^, ^s)^ , ^k = ^ 4s- 

Roughly speaking, Sks measures how much the two components overlap and Sk mea- 
sures how much the kth component overlaps with all other components. We note 
that 6ks can be computed by solving a sequence of simple quadratically constrained 
quadratic programming (QCQP). More concretely, for a given value of c, we find 
sup^ ip{^; /ifc, Sfc) under the constraint ip{^; /I^, S^) > c, which is a simple QCQP. De- 
note this value by //(c), then it can be shown that 6ks is either trivial {ns(p{fj,; 'fis, ^s) < 
TTkififi] /ifc, Sfc)) or it can be given by the unique c* such that Tf^c* = TTsri{c*). In the 
non-trivial case, c* can be found using a simple binary search. 

When all 4 are smaller than A, we have a better approximation of Tn'. 

K K 

Tn = {i: m > A} C |J{e : <^(e; /I^, %) > (A - 4)/vrfc} := |J f^,k. (7) 



k=l k=l 



Note that T„ ^ is also an ellipsoid. We can similarly compute ik = iiifggf ^ s 
Uk = sup^g^^^^ ^'^4>{t), and the band = Ufc[4,Mfc]- 

Proposition 3.2. The prediction band Bn{t) constructed above is a union of K bands, 
and 

P|3A;: [Ii^{X^+^)] {t) E [4(t), ^^(t)], Vt} > 1 - a. 

We implement this method on a real data example. The data set consists of 1,000 
recordings of neurons over time (see Figure |l|(a)). The data come from a behavioral 
experiment, performed at the Andrew Schwartz motorlatQ A macaque monkey per- 

^ http : //motorlab . neurobio . pitt . edu/ index . php| 
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Neuron Data 



Phoneme Data 





Frequency 



Figure 1: Two functional data sets, (a): neuron data, (b): phoneme data. 

forms a center-out and out-center target reaching task with 26 targets in a virtual 
3D environment. The curves show voltage of neurons versus times recorded at elec- 
trodes. The recorded neural activity consists of all action potentials detected above 
a channel-specific threshold on a 96-channel Utah array implanted in the primary 
motor cortex. One of the goals is "spike sorting" which means clustering the curves. 
Each cluster is thought to correspond to one neuron since each neuron tends to have 
a characteristic curve (spike). 

Our implementation uses functional principal components with p = 2 and K = 3. 
A band consists of three components is plotted along with the projected sample 
curves. The empirical coverage is 916 out of 1000. See Figure [2] 



A different conformity score. The approximation introduced above may still be 
too conservative when the mixture components are not well separated. An example 
of such a situation can be seen from the phoneme data, which is considered in [Hastie 
et al. ( 2009| ) and |Ferraty fc Vieu (2006). The data considered here consists of three 



phonemes, where each phoneme has 400 sample curves of discretized periodograms 
of length 150|^ The data is plotted in Figure [l] (b). 

When analyzing the phoneme data, we also perform a FPCA and focus on the first 
two principal component scores. Note that our method can be implemented using 
other variants such as robust FPCA and using more dimensions. Here we choose 
two principal components for the ease of visualization. In the original data, the 
label of each curve is known and we summarize this information in Figure |3] (a) (but 
our algorithm assumes that the labels are unknown). We observe that the cluster 



^Further information and the data set can be found at (http://www.math.univ-toulouse.fr/ 
[s t aph/npf da/npf da- phoneme - de s . pdf J . 
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Figure 2: Prediction Bands for Neuron Data, (a): plots of first two FPC coefficients 
with boundary of T„,fc. (b): conformal prediction bands and projected curves. 



corresponding to phoneme "del" exhibits significant non-Gaussianity, therefore the 
Gaussian mixture model fitting suggests that this single cluster is best modeled by a 
two-component Gaussian mixture. The band for that cluster is just the union of the 
bands given by the two components. 

In this case, if the approximation in ([T]) is used, the resulting prediction and hence 
the band will be too wide than necessary due to the heavy overlap between the two 
components corresponding to phoneme "del". Here we use a different conformity 
score to obtain a better prediction set. In particular, consider 

/(O = max 7rfc(^(^;/ifc,Sfc). (8) 

k:l<k<K 

That is, instead of computing the sum of density from each component, we only look 
at the density of the cluster that ^ is most likely to belong to. Such a conformity 
score function resembles that of a K-means clustering, except it takes into account of 
the mixture probabilities. 

The advantage of using such a max function is the decomposability: 

K K 

Tn = {^: m > A} = |J{e : f{^;Jik,^k) > A/vr,} := [J f.,^. (9) 

k=l k=l 

Since Tn^k is also an ellipsoid, we can analogously define ik(t) = inf^g-f^ ^ ^(p(t), Ukif) = 
sup^g^„,fe^^0(^)) and = Ufe[4(t), ^^(t)]. 

Remark 3.2. We note that T^^k defined in and Tn,k defined in |^ are very close 
if the components are well-separated because 5^ ~ for all k and f defined in is 
very close to the estimated density. 
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Figure 3: Prediction Bands for Phoneme Data, (a): plots of first two FPC coefficients 
with boundary of T„,fe. (b): conformal prediction bands and projected curves. 

Since the union representation in the above equation is exact, there is no loss of 
statistical efficiency. The only possible loss of efficiency is using a conformity score 
function other than the density function. However, such a loss is conceivably small: 
since if the max contributor is large, the density is likely to be large. Therefore, 
ranking the max component density is roughly the same as ranking the density. The 
prediction band for the phoneme data is plotted in Figure [3] (b) where the empirical 
coverage is 90.5%. 



4 Methods Based on Pseudo-Densities 

Here we investigate a different approach, based on pseudo-densities which were intro- 



duced by Ferraty & Vieu (2006). For simplicity, assume na is an integer. Given a 



kernel K and bandwidth h > define the pseudo-density estimator 

Mu)^'-±k('-^) (10) 



n ^ V ^ 

1=1 ^ 

where d{f,g) is some distance measure (for example, d{f,g) = [/(/ — (7)^]^/^). We 
assume that K{z) < K{0) for all z. This looks like a kernel density estimator but it 
is not a density function. Indeed, there are no density functions on Q because there is 
no (T-finite dominating measure. Nevertheless, Ferraty and Vieu show that Ph{u) can 
be used for various tasks such as clustering curves, just as we would do for a density 
estimator. We can view ph as an estimator of Ph{u) = E,{ph{u)) = J K(^'^^^j^^dP{x). 
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Figure 4: Neuron data. X axis: time. F axis: response, (a) 1,000 recordings of action 
potentials over time, (b-e) anormalies, median, and high density sample curves using 
pseudo-density, (f) mean-shift prototypes. 



We follow a similar approach here, and endow the high density sets in the func- 
tional space with a conformal prediction interpretation. For presentation simplicity, 
we use the standard conformal method. Let Cn,a = {/ : 7r(/) > a} where 

^^■^^ = ' 



It is straightforward to see that Cn,a constructed above is a level 1 — a conformal 
prediction set, and hence has distribution-free, finite sample coverage. The set Cn,a 
can be approximated by a level set of ph as follows. Let X(i),X(2), . . . , denote the 
re-ordered data, ranked so that p/i(X(i)) < pft(X(2)) < ■ ■ ■ . Let A = p/i(X(„Q,)) and let 

C+„ = {/: PH{f)>)^-n-'K{0)}. (11) 

We have the following result ensuring that the level sets of the pseudo-density have a 
well-defined finite sample predictive interpretation. Its proof is analogous to that of 
ordinary kernel density for vectors (see Lei et al. (2013)). 

Lemma 4.1. We have P(X„_|_i G Cn^a) > 1 — a for all P and n. Furthermore, 
Cn,a ^ C^a ^'^^ heuce, P(X„_|_i G C^a) > 1 — a for all P and n. 

Now we describe how to make use of and explore the set C+^. Let C„_q = C+„ fl 
{Xi, . . . , X„}. 
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Anomalies, median set, and high density curves We first consider three such 
sets of Cn,a- anomalies (a close to 0), typical (a = .5) and high density (a close to 
1). Figures [4] (b-e) show the results using the distance d?{f,g) = J{f{t) — g{t)Ydt 
for the neuron data. It is easily seen that the median set successfully captures the 
common shape of each group of neurons. The high density set contains curves in the 
two larger groups. The anomalies consist of mostly curves with irregular shape. 



Modes, prototypes, and mean shift Another summary of Cn,a are prototypes, 
that is, representative functions corresponding to the local maxima of the pseudo 
density (see also the cluster tree below). The modes of the pseudo-density can be 
obtained using the mean-shift algorithm [Cheng (1995). Figure |4](f) shows the three 
modes obtained in our example. Two of the modes have the signature of a neuron 
firing (a decrease followed by a sharp increase). The third mode, which stays negative, 
is unusual and deserves further attention. 



Conformal cluster tree Here we use a cluster tree to visualize how the conformal 
prediction set evolves as a changes smoothly from to 1. For a given e > 0, define the 
graph Ga,e whose nodes correspond to the functions in C„^q, and with edges between 
nodes Xi and Xj if f{Xi(t) — Xj{t)Ydt < e^. Define the level a clusters to be 
the partition of Cn,a induced by connected components of Ga,f As the conformal 
parameter a varies from to 1, the collection T of all level a clusters form a tree (i.e. 
A, B ^ T implies that Ar\B = ^oTAGBoTB<zA), which we call the conformal 
tree. The height of the tree is indexed by a, with the root of tree indexed by o; = 0, 
corresponding to all the points in the dataset. As a increases the sets Cn,a becomes 
smaller, and the leaves, which are associated to local modes of the pseudo density, 
consist each of a single data point. Such a conformal tree is similar to an ordinary 
clustering tree, plus the additional feature of finite sample coverage and a different 
but quite natural indexing on a rather than the usual indexing on the density itself. 

The conformal tree provides a graphical representation of some distributional 
properties of the data and, in particular, of the "high- density" regions. As such 
it allows us to identify salient features about the data that may be otherwise hard to 
detect. Figure [Sj^a) shows the conformity tree of the neuron data based on the pseudo 
density estimator described above using the L2 distance and the Gaussian kernel. It 
is easy to see how the data appear to arise from a mixture of three components (com- 
pare with Figure [2]), with splits occurring at values of a equal to approximately 0.08 
and 0.24. A further, more subtle, split occurs at a = 0.79, and distinguishes between 
two groups of curves exhibiting very similar but ultimately distinct behavior. The 
leaves of the tree correspond to prototypical curves of highest pseudo density within 
each component. In Figure [sj^b) we present the conformal tree with h = e = 1000 as 
in Figure |4j The tree is different but suggests strongly the three-component structure. 
It also rises the issue of choosing tuning parameters, an important problem for future 
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(a) Bandwidth = 3272 (b) Bandwidth = 1000 




Figure 5: Conformal trees for the neuron data, (a): conformal tree with band- 
width h chosen to maximize the variance of p/i(Xi), ...,pft(X„) and e set to 0.5 x 
max{min^^ d{Xi,Xj)}. (b): conformal tree obtained using: h = e = 1000. 
The plots show the clusters for different values of a. Next to each node (e.g., split), 
we plot the curves belonging to the elements of the corresponding partition of Cn,a- 
Isolated clusters of size smaller than 10 are ignored. 



study. 



5 Conclusion and Future Directions 

We have extended conformal methods and its variants to functional data. Using the 
inductive conformal predictor, together with basis projections, we obtain the first 
distribution-free simultaneous prediction bands for functional data with finite sample 
performance guarantee. These prediction bands can also be viewed as quantile sets of 
the underlying process that corresponding to high density region. On the other hand, 
using standard conformal prediction with pseudo densities, the prediction set Cn also 
reveals hierarchical and salient features in the data. We have proposed methods for 
extracting information from and visualizing C„. Currently, we are investigating several 
open questions such as extension of the new conformal method to other conformity 
scores in high dimensional problems and issues regarding optimality and the selection 
of tuning parameters. For functional data, the choice of distance in pseudo densities 
is also interesting. It can be shown that for the L2 distance, the induced bands are 
unbounded. The results using the distance d?{f,g) = Ylij^'^KPj " ^jY example 
are somewhat different (not shown). Here, and {9j} are the coefficients / and 
g in the cosine basis. We call this the analytic distance as it is related to the set 



of analytic functions; see page 46 of Efromovich (1999). This distance is much more 
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sensitive to the shape of the curve. 
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